Model Selection

Multimodal visual understanding

# Multimodal visual understanding

Gemma 3 12b It Quantized.w8a8

An INT8 quantized version based on google/gemma-3-12b-it, supporting visual text input and text output, suitable for efficient inference deployment

Qwen2.5 VL 32B Instruct Exl2 4 25bpw

Qwen2.5-VL-32B-Instruct is the latest vision - language model in the Qwen family, with powerful multimodal understanding and generation capabilities, supporting the interaction of images, videos, and text.

Transformers English

christopherthompson81

Amoral Gemma3 12B Vision

Vision-enhanced version based on soob3123/amoral-gemma3-12B, combining Gemma3-12B large language model with visual encoder for multimodal tasks

Transformers English

Qwen2 VL 72B Instruct GGUF

The GGUF quantized version of Qwen2-VL-72B-Instruct, supporting multimodal image-text to text conversion, which can be run through LlamaEdge.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase